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ABSTRACT 

The equivalence of pencil and paper Rasch item 
calibrations when used in a computer adaptive test administration was 
explored in this study. Items (n=726) were precal ibar t ed with the 
pencil and paper test adminis trat ions . A computer adaptive test was 
administered to 321 medical technology students using the pencil and 
paper precalibrations in the item selection algorithms and in the 
computation of examinee ability estimates. The response data from the 
computer adaptive test administration were analyzed yielding 
recalibrated item difficulties and examinee ability estimates. Item 
precalibrations were compared with item recalibrations. Examinee 
ability estimates obtained using the item precalibrations on the 
computer adaptive administration were compared with the examinee 
ability estimates obtained from using the item recalibrations on the 
computer adaptive administration. The correlation for examinee 
ability estimates was 0.99 and for item correlations it was 0.90. 

Some item calibrations shifted but most remained consistent within 
the limits of error. Item shift, however, did not affect the ordering 
of examinee ability estimates. (Contains 1 table, 3 figures, and 23 
references . ) (Author/ SLD) 
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Equivalence of Rasch Item Calibrations and Ability Estimates 
Across Modes of Administration 



Abstract 

The purpose of this paper is to explore the equivalence of pencil 
and paper Rasch item calibrations when used in a computer adaptive test 
administration. Items were precalibrated with pencil and paper test 
administrations. A computer adaptive test was administered using the 
pencil and paper precalibrations in the item selection algorithm and in 
the computation of examinee ability estimates. The response data from 
the computer adaptive test administration were analyzed yielding 
recalibrated item difficulties and examinee ability estimates. Item 
precalibrations were compared with item recalibrations. Examinee ability 
estimates obtained using the item precalibrations on the computer 
adaptive administration were compared with the examinee ability estimates 
obtained using the item recalibrations on the computer adaptive 
administration. The correlation for examinee ability estimates was .99 
and for item calibrations was .90. Some item calibrations shifted but 
most remained consistent within the limits of error. Item shift, 
however, did not affect the ordering of examinee ability estimates. 



Key words: CAT, Rasch model, item calibration, 



Equivalence of Rasch Item Calibrations and Ability Estimates 
Across Modes of Administration 



When a computer test is adaptive, a participant is administered 
items based on a current ability estimate. When an item is answered 
correctly, an ability estimate is calculated and a more difficult item is 
presented. When an item is answered incorrectly, a lower ability 
estimate is calculated and an easier item is presented. The most 
informative and hence the most useful items are presented to each 
examinee so that responses to fewer items are required to achieve the 
same level of precision. In order for an item to be used efficiently 
with the computer adaptive algorithm, it must be precalibrated using a 
latent trait model such as the Rasch model which orders items from easy 
to difficult. A pencil and paper administration or a previous computer 
adaptive administration can be used for item precalibration. 

Many organizations have item pools calibrated from previous pencil 
and paper administrations. However, the use of these calibrations for a 
computer adaptive test needs careful consideration. Since the mode of 
administration is different, there is a possibility that items are 

somehow "different" when presented on a computer instead of on a piece of 

% \ 

paper. If items are "different", pencil and paper calibrations may not 
be appropriate for a computer adaptive test. In a computer adaptive test 
each examinee takes an individualized test. Items are presented to 
different examinees in different contexts and at different points during 
the test administration. Thus context effects and location effects are 
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potentially different for each examinee. In a paper and pencil test item 
location and context do not fluctuate. If the location and/or context 
affect the item calibration, the paper and pencil calibration may not be 
appropriate for a computer adaptive test. 

The possibility that item precalibrations might change due to the 
mode of administration, namely, paper and pencil vs. computer adaptive, 
has been discussed by several researchers (Kingsbury and Houser, 1989, 
and Wise, et al. , 1989). Green, Bock, Humphreys, Linn and Reckase 
(1984) suggest several possible problems which might arise when items 
for a computer adaptive test are precalibrated using data from a paper 
and pencil test. An overall shift might occur, such that all items 
become easier or harder, or an "item by mode interaction" might occur 
where some, but not all, item parameters change. They postulate that 
items with diagrams or many lines of text may be most vulnerable to an 
item by mode interaction. 

Context effects have been addressed by Kingston and Dorans (1984). 
They note that the appropriateness of IRT equating based on 
precalibration requires that changes in position of items in a test 
between the preoperational calibration and operational administrations of 
the test have no effect on item parameter estimates. They found some 
types of complex items, especially those which require extensive 
instructions, to be particularly sensitive to location effects and thus 
possibly unsuitable for computer adaptive administration. Yen (1980) 
also found item characteristics to be affected by the sequence in which 
items were administered. 
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One of the consequences of targeting items to the ability level of 
the examinee is that examinees of different ability levels may be 
presented with items in different difficulty order. Folk (1990) points 
out that a high ability examinee will generally answer the initial items 
of a computer adaptive test correctly and then will receive more 
difficult items. This results in his test being structured from easy to 
hard. A low ability examinee will not answer the initial items correctly 
which results in his test being structured from hard to easy. However, 
Folk found that the administration of items in different orders did not 
affect substantially the performance of low or high ability examinees. 

Other potential problems in precalibrating items with a pencil and 
paper test for computer adaptive administration have been addressed by 
Wainer and Keily (1987). One of these is the differential effect of 
cross information encountered in computer adaptive testing. If a paper 
and pencil item provides a cue for another item, all examinees receive 
the same cue. With a computer adaptive test, examinees are administered 
different items and items are ordered differently. If an item 
calibration is influenced by a cueing effect in a pencil and paper 

* i 

administration it may be invalid for the computer adaptive 
administration. They also point out that one of the virtues of computer 
adaptive testing- -short test length-- may become problematic if item 
calibrations are unstable. Since the shorter test lacks the redundancy 
of a conventional test it will be more vulnerable to idiosyncrasies of 
item performance. 
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If items have not been precalibrated, an initial pencil and paper 
administration may be practical for those considering computer adaptive 
testing. In this case, the size and composition of the sample needed for 
precalibration of items must be considered. It has been suggested that the 
sample include a minimum of 1,000 respondents and be comparable to the target 
population (Rudner, 1989, Green et al,. , 1984). However, it may be difficult 
to amass a comparable sample population this large. 

Computer adaptive testing has been shown to reduce test length without 
loss of precision (Weiss, 1983, 1985; Weiss and Kingsbury, 1984; McKinley and 
Reckase, 1980, 1984; Olsen, et. al., 1986). When the items presented are 
targeted or tailored to the ability of the examinee, fewer items are required 
to estimate ability or reach a pass/fail decision. This assumes that the item 
difficulty calibrations are equivalent regardless of the mode of 
administration under which the calibrations were obtained. 

Purpose 

The purpose of this paper is to explore two related issues to determine 
whether item precalibrations from pencil and paper tests are appropriate for 
use in computer adaptive testing. The first issue is the equivalence of item 
calibrations from paper and pencil and computer adaptive administrations. The 
second issue is the equivalence of examinee ability estimates when item 
precalibrations from paper and pencil tests versus item calibrations from 
computer adaptive tests are used for the tailoring algorithm. 
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Method 

Precalibration 

Three hundred and twenty-one medical technology students, from 57 
educational (training) programs across the country provided data for the 
precalibration of 726 items. To participate, students had to be eligible to 
take the first semi-annual administration of the related certification 
examination. 

Each student took one of four different forms of a 200 item conventional 
pencil and paper test. Each form included a subset of common items for 
equating so that all forms could be placed on the same scale. Form 1 was 
taken by 73 students, Form 2 by 86 students, Form 3 by 71 students, and Form 4 
by 91 students. Each of the four forms was calibrated by the Rasch model 
program MSCALE (Wright, Congdon and Schultz, 1987). The analysis of the 
inf it 1 , a statistic designed to assess the suitability of the data for 
constructing ability estimates, revealed that low ability individuals did not 
answer items correctly more than statistically expected. This confirmed that 
the precalibrations could be used in the CAT algorithm. The forms were 
equated using common item equating (Wright and Stone, 1979) and item 
precalibrations for the 726 items were established. 

CAT Administration 

'.V 

The computer adaptive test administration used 1187 students from 238 
educational programs across the country. To participate, students had to be 
eligible to take the second semi-annual administration of the related 

1 infit statistic is the information weighted mean-square residual that is 
sensitive to an accumulation of central or inlying deviations. The expected 
value for the mean squares is one (1.0) and their asymptotic standard errors 
are approximately the square root of (2/df) where df is the number of 
independent replications on which the corresponding estimate is based. 




certification examination. Programs who participated in the precalibration 
were not eligible to participate in the computer adaptive test. The two 
samples were considered comparable. 

The computer adaptive testing model used in this study has the following 
characteristics. It is designed as a mastery model (Weiss and Kingsbury, 

1984) to determine whether a person's estimated ability level is above or 
below a pre-established criterion expressed in the metric (logits) of the 
calibrated item pool scale. Kingsbury and Houser (1990) have shown that an 
adaptive testing procedure which provides maximum information about the 
examinee's ability will provide a more clear indication that the examinee is 
above or below the pass/fail point than a test which peaks the information at 
the pass/fail point. 

The test was stopped when the examinee ability estimate was either 1.3 
times the error of measure above the pass/fail point (a clear pass- -one tailed 
90% confidence level) or 1.3 times the error of measure below the pass/fail 
point (a clear fail) or when a maximum test length of 240 2 items was reached. 
Minimum test length was 50 items and the pass/fail point was set at .15 
logits. 

The Rasch model (Rasch, 1960/1980) was used to calibrate items and 
estimate person ability. The Rasch model was selected because the sample 
sizes were not large enough to meet the requirements of 2 and 3 parameter 
models (Lord, 1983) and there is evidence that the examinee abilities 
estimated with the Rasch and the 2 and 3 parameter models correlate highly 
(.90) when tests are administered under a computer adaptive algorithm 

2 In this study 240 items was set as the maximum because it is comparable to 
the current paper and pencil test. The intent was to verify that fewer items 
were necessary, but this required that a relatively high number of items be 
allowed. 
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(Olsen, et ai, 1986). Since the goal is accurate examinee ability estimates, 
the Rasch model was able to provide sufficient and complete information. The 
PROX version of the maximum likelihood method of item selection (Wright and 
Stone, 1979) was used in the adaptive algorithm. 

The calibrated item pool contained 726 multiple choice (4-choice) items 
from six subsets. Table 1 provides a summary of the item pool distribution. 
Content coverage was designed to be comparable to the test specifications for 
the conventional paper and pencil certification examination and a content 
balancing mechanism of the type described by Kingsbury and Zara (1989) was 
included in the item selection algorithm. In the first 50 items, blocks of 
ten items were administered from subsets 1-4 and blocks of 5 items were 
administered from subsets 5 and 6. After 50 items, blocks of 4 items (subsets 
1-4) and blocks of 2 items (subsets 5 and 6) were administered. Subset order 
was selected randomly by the computer algorithm. Maurelli and Weiss (1981) 
found subtest order to have no effect on the psychometric properties of an 
existent achievement test battery. 

INSERT TABLE 1 ABOUT HERE 




The computer adaptive test administrator program (Gershon, 1989) 
implemented the adaptive algorithm and the content balancing requirements. 
Items were chosen at random from unused items within .10 logits of the 
targeted item difficulty within the specified content area. While the 
examinee considered the item presented, the computer selected two items, one 
which would yield maximum information should the current item be answered 
incorrectly and another which would yield maximum information should the 
current item be answered correctly. This insured that there was no lag time 
before the next item was presented. Examinees were allowed 4 hours to 




complete the test. 

Recallbratlon from CAT Administration 
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To determine the equivalence of item calibrations and to determine 
whether shifts in item calibration affect examinee ability estimates, the 
response data from the computer adaptive test administration were 
recalibrated. Each computer adaptive test yielded an examinee response 
string. While the entire item pool consisted of 726 items, each examinee 
response string contained responses from between 50 items (minimum test 
length) to 240 items (maximum test length). Each item had a unique 
identifying number. Response strings from all examinees were appended, 
resulting in a file containing an 1187 (examinee) by 726 (item) matrix with 
missing data for all items not presented to a particular examinee. 

Figure 1 shows the frequency of item use on the CAT compared with the 
difficulty of the item precalibration. The mean number of examinees to whom 
an item in the CAT was administered was 161.62 with a standard deviation of 
88.43. The minimum number of examinees was 13 and the maximum number of 
examinees was 382. Items with calibrations between -1 and 1 logits were 
administered most frequently. Thus the number of examinees used to 
recalibrate each item after the CAT administration varied considerably. 



INSERT FIGURE 1 ABOUT HERE 



The 1187 by 726 response matrix was analyzed with BIGSCALE (Wright, 
Linacre, and Schultz 1989) an update of the MSCALE program which processes 
large data sets that have missing data. Examinee ability estimates were not 
held constant but recalculated based on the item recalibrations. This 
procedure produced a new set of item calibrations and a new set of examinee 
ability estimates based upon responses from the CAT administration only. The 
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fit of the item recalibrations to the model was reviewed, and the items were 
found to be suitable for constructing ability estimates. The infit statistic 
for all items indicated that low ability examinees did not get items correct 
more frequently than statistically expected. 

Comparison of Item Calibrations. 

The item precalibrations, obtained from the pencil and paper test, were 
compared with the item recalibrations from the CAT administration. Item 
precalibrations were based on sample sizes of 71 to 91 examinees. Item 
recalibrations were based on sample sizes of 13 to 382 examinees. Summary 
statistics, and correlations were calculated. Then the recalibrated item 
distribution was adjusted for the difference in means and standard deviations 
and the logit differences were calculated. 

Comparison of Ability Estimates 

During the CAT administration the computer adaptive algorithm calculated 
examinee ability estimates based on the pencil and paper precalibrated items. 
In the recalibration, examinee ability estimates and item recalibrations were 
calculated simultaneously from the response data collected during the CAT 
administration. The examinee ability estimates obtained in the CAT 
administration using the paper and pencil item precalibrations were compared 
with the examinee ability estimates obtained after the item recalibration. 
Summary statistics, correlations and logit differences were calculated. 

Results 

Item Calibrations 

The correlation for pencil and paper item precalibrations and the 
computer adaptive test recalibrations was .90 indicating that some shift did 
occur for some of the items (See Figure 2). Precalibrations ranged from -3.61 
to 3.84 logits, with a mean of -.02 and a standard deviation of 1.00. 
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INSERT FIGURE 2 ABOUT HERE 



Recalibrations ranged from -3.84 to 3.60 with a mean of 0.00 and a standard 
deviation of 1.22. 

Two types of shift occurred. The first is an overall shift, indicated 
by the difference in the mean and standard deviation of the item 
precalibrations compared to the mean and standard deviation of the item 
recalibrations. The spread of the item recalibrations (S.D. -1.22) is wider 
than the spread of the item precalibrations (S.D. -1.00). The effect of the 
difference in standard deviations is that the hard items appear harder when 
recalibrated and the easy items appear easier. 

The second type of shift occurred with specific items. After the 
distribution of the recalibrated items is adjusted for differences in the two 
means and standard deviations, some item calibrations still shift and the 
order of item difficulty is altered. A few items recalibrate as more 
difficult than they did originally on the precalibration and a few items 
recalibrate as less difficult. 

The shifts in item calibrations from precalibration (small sample, 
pencil and paper administration) to recalibration (varying sample per item, 
computer adaptive administration) may be due to the mode of administration or 
to item bias (a difference in the intent or preparation between the 
precalibration sample population and the recalibration sample population). 

For example, of the 7 items with the largest shift in the direction of easier 
on the recalibration, 5 were from the same content area, indicating possible 
differential preparation between the two sample populations. 
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Examinee Ability Estimates 

The mean ability estimate calculated with precalibrated items was .21 
with a standard deviation of .51. The mean ability estimate calculated with 
recalibrated items was .24 with a standard deviation of .58. The mean logit 
difference between ability estimates was -.03 with a standard deviation of 



The correlation of the examinee ability estimates calculated using the 
paper and pencil item precalibrations and the examinee ability estimates 
calculated using computer adaptive item recalibrations was .99. Figure 3 
shows that there is virtually no difference between the examinee ability 
estimates using item precalibrations (paper and Pencil) or recalibrations 
(computer) . 



In this study, even though the precalibrations were obtained from a 
pencil and paper administration with relatively few participants, most of the 
Rasch item calibrations remained stable when recalibrated from the computer 
adaptive administration. The results demonstrate that items precalibrated in 
a pencil and paper administration can be used for computer adaptive tests. The 
item calibrations were equivalent given varying numbers of examinees p 
different contexts, and varying modes of administration. The pencil and paper 
precalibrations used a sample of examinees of varying ability levels so each 
item was calibrated from a range of examinee abilities. Items on the computer 
adaptive administration for each item were targeted to the examinee ability so 
the recalibrations were based on a smaller range of ability levels. 
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INSERT FIGURE 3 ABOUT HERE 
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Two types of shifts occurred in the item calibrations. The first type, 
an overall shift in mean and standard deviation, can be corrected for by using 
an equating transformation. The second type of shift, a shift in the 
calibration of certain items is potentially more problematic because examinees 
take different items. This means that when some items shift, some examinees 
may be differentially affected depending upon how many of the shifted items 
are presented to them. 

The examinee ability estimates correlation of .99 indicates that even 
though a small percentage of the item calibrations shift, the examinee ability 
estimates are not affected. No examinee ability estimates differed beyond the 
variance expected due to error of measure. However, if shift in item 
calibration is a concern, the items can be identified and revised or discarded 
for subsequent CAT administrations. The examinee ability estimates however, 
can be considered valid even if it is necessary to re-evaluate some items. 

This study confirms that organizations desiring to utilize the 
technology of the computer for administering tests can do so without the 
expense of recalibrating their item pools. Of course, the item pool must be 
continually monitored for drift, validity and quality of item content whether 
tests are administered in a paper and pencil or computer adaptive mode. 
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Figure 2 

Comparison of Item Calibrations 
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Figure 3 

COMPARISON OF ABILITY ESTIMATES 
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